{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "# @author: tongzi\n",
    "# @created date: 2019/08/14\n",
    "# @description: featur engineering\n",
    "# @last modification: 2019/10/9\n",
    "# "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.4.1分类特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},\n",
    "    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},\n",
    "    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},\n",
    "    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用独热编码（one-hot-encoding）来把分类特征映射到整数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[     0,      1,      0, 850000,      4],\n",
       "       [     1,      0,      0, 700000,      3],\n",
       "       [     0,      0,      1, 650000,      3],\n",
       "       [     1,      0,      0, 600000,      2]], dtype=int32)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "vec = DictVectorizer(sparse=False, dtype=int)\n",
    "vec.fit_transform(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['neighborhood=Fremont',\n",
       " 'neighborhood=Queen Anne',\n",
       " 'neighborhood=Wallingford',\n",
       " 'price',\n",
       " 'rooms']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "独热编码带来很多冗余（很多的0），可以使用稀疏矩阵进行存储："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<4x5 sparse matrix of type '<class 'numpy.int32'>'\n",
       "\twith 12 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec = DictVectorizer(sparse=True, dtype=int)\n",
    "vec.fit_transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在拟合和评估模型时，Scikit-Learn的许多评估器都支持稀疏矩阵输入。$sklearn.preprocessing.OneHotEncoder$和$sklearn.feature\\_extraction.FeatureHasher$是Scikit-Learn另外两个为分类特征编码的工具。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.4.2文本特征  \n",
    "词频统计："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = ['problem of evil',\n",
    "'evil queen',\n",
    "'horizon problem']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x5 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 7 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vec = CountVectorizer()\n",
    "X = vec.fit_transform(sample)\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果是一个稀疏矩阵，里面记录了每个短语中每个单词的出现次数。用DataFrame表示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>evil</th>\n",
       "      <th>horizon</th>\n",
       "      <th>of</th>\n",
       "      <th>problem</th>\n",
       "      <th>queen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   evil  horizon  of  problem  queen\n",
       "0     1        0   1        1      0\n",
       "1     1        0   0        0      1\n",
       "2     0        1   0        1      0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不过这种统计方法也有一些问题：原始的单词统计会让一些常用词聚集太高的权重，在分\n",
    "类算法中这样并不合理。解决这个问题的方法就是通过 TF–IDF（term frequency–inverse\n",
    "document frequency，词频逆文档频率），通过单词在文档中出现的频率来衡量其权重\n",
    "3 。计算这些特征的语法和之前的示例类似："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>evil</th>\n",
       "      <th>horizon</th>\n",
       "      <th>of</th>\n",
       "      <th>problem</th>\n",
       "      <th>queen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.517856</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.680919</td>\n",
       "      <td>0.517856</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.605349</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.795961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.795961</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.605349</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       evil   horizon        of   problem     queen\n",
       "0  0.517856  0.000000  0.680919  0.517856  0.000000\n",
       "1  0.605349  0.000000  0.000000  0.000000  0.795961\n",
       "2  0.000000  0.795961  0.000000  0.605349  0.000000"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vec = TfidfVectorizer()\n",
    "X = vec.fit_transform(sample)\n",
    "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.4.3图像特征  \n",
    "机器学习还有一种常见需求，那就是对图像进行编码。我们在 5.2 节处理手写数字图像时\n",
    "使用的方法，是最简单的图像编码方法：用像素表示图像。但是在其他类型的任务中，这\n",
    "类方法可能不太合适。  \n",
    "  \n",
    "  \n",
    "虽然完整地介绍图像特征的提取技术超出了本章的范围，但是你可以在 Scikit-Image 项目\n",
    "（http://scikit-image.org）中找到许多标准方法的高品质实现。关于同时使用 Scikit-Learn 和Scikit-Image 的示例，请参见 5.14 节。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.4.4衍生特征  \n",
    "还有一种有用的特征是输入特征经过数学变换衍生出来的新特征。我们在 5.3 节从输入数据中构造多项式特征时，曾经见过这类特征。我们发现将一个线性回归转换成多项式回归时，并不是通过改变模型来实现，而是通过改变输入数据！这种处理方式有时被称为基函数回归（basis function regression），详细请参见 5.6 节。  \n",
    "  \n",
    "例如，下面的数据显然不能用一条直线描述"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAADuVJREFUeJzt3V1sZHd9xvHnqdeUIQm1xI5o1pvW6o2lNpR4O4qCVorShOIEorCiuQgS0CBV2xfUJmplVPeiFb3hwhKiL1LRNqFNS8JLg2OFiMSkCghxwaLZeMEJG1cpCiJ22p1QOS9lBBvz64XHYXewd85k58yZ3+b7kUZ7fM5/5zz6Z+fx8Zn/xI4IAQDy+IWqAwAA+kNxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJLOvjCfdv39/TE1NlfHUAHBROnHixPMRUS8ytpTinpqaUrPZLOOpAeCiZPv7RcdyqwQAkqG4ASAZihsAkqG4ASAZihsAkulZ3LanbZ886/Gi7TuHEQ4A8PN6LgeMiDVJV0mS7TFJ65IeKDkXAKSwtLKuheU1bWy2dWCiprnZaR2ZmSz1nP2u475B0n9FROH1hgBwsVpaWdf84qraZ7YkSeubbc0vrkpSqeXd7z3u2yR9towgAJDNwvLaq6W9o31mSwvLa6Wet3Bx236DpFsk/fsex4/abtputlqtQeUDgJG1sdnua/+g9HPFfZOkxyPif3Y7GBHHIqIREY16vdDH7QEgtQMTtb72D0o/xf1+cZsEAF41Nzut2vjYOftq42Oam50u9byF3py0/SZJvyPpD0pNAwCJ7LwBOZKrSiLiR5LeUmoSAEjoyMxk6UXdjU9OAkAyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJFOouG1P2L7f9lO2T9l+R9nBAAC721dw3N9KeiQibrX9BklvKjETAOA8eha37TdLulbS7ZIUET+R9JNyYwEA9lLkVsmvSWpJ+mfbK7bvsn1J9yDbR203bTdbrdbAgwIAthUp7n2SDkn6x4iYkfR/kv6ie1BEHIuIRkQ06vX6gGMCAHYUKe5nJT0bEcc7X9+v7SIHAFSgZ3FHxH9L+oHt6c6uGyR9t9RUAIA9FV1V8ieS7u2sKPmepA+XFwkAcD6FijsiTkpqlJwFAFAAn5wEgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIZl+RQbafkfSSpC1Jr0REo8xQAIC9FSrujt+OiOdLSwIAKIRbJQCQTNHiDklfsX3C9tHdBtg+artpu9lqtQaXEABwjqLFfTgiDkm6SdJHbF/bPSAijkVEIyIa9Xp9oCEBAD9TqLgjYqPz52lJD0i6usxQAIC99Sxu25fYvmxnW9K7JD1RdjAAwO6KrCp5q6QHbO+Mvy8iHik1FQBgTz2LOyK+J+ntQ8gCACiA5YAAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkEzh4rY9ZnvF9kNlBgIAnN++PsbeIemUpDeXEWRpZV0Ly2va2GzrwERNc7PTOjIzWcapACC1Qlfctg9Keo+ku8oIsbSyrvnFVa1vthWS1jfbml9c1dLKehmnA4DUit4q+aSkj0r6aRkhFpbX1D6zdc6+9pktLSyvlXE6AEitZ3HbvlnS6Yg40WPcUdtN281Wq9VXiI3Ndl/7AeD1rMgV92FJt9h+RtLnJF1v+zPdgyLiWEQ0IqJRr9f7CnFgotbXfgB4PetZ3BExHxEHI2JK0m2SHouIDwwyxNzstGrjY+fsq42PaW52epCnAYCLQj+rSkqzs3qEVSUA0JsjYuBP2mg0otlsDvx5AeBiZftERDSKjOWTkwCQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMn0LG7bb7T9Ldvftv2k7Y8NIxgAYHf7Coz5saTrI+Jl2+OSvmH74Yj4ZsnZAFRkaWVdC8tr2ths68BETXOz0zoyM1l1LHT0LO6ICEkvd74c7zyizFAAqrO0sq75xVW1z2xJktY325pfXJUkyntEFLrHbXvM9klJpyU9GhHHy40FoCoLy2uvlvaO9pktLSyvVZQI3QoVd0RsRcRVkg5Kutr2ld1jbB+13bTdbLVag84JYEg2Ntt97cfw9bWqJCI2JX1N0o27HDsWEY2IaNTr9QHFAzBsByZqfe3H8BVZVVK3PdHZrkl6p6Snyg4GoBpzs9OqjY+ds682Pqa52emKEqFbkVUll0u6x/aYtov+CxHxULmxAFRl5w1IVpWMriKrSr4jaWYIWQCMiCMzkxT1COOTkwCQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMn0LG7bV9j+qu1Ttp+0fccwggEAdrevwJhXJP15RDxu+zJJJ2w/GhHfLTkbzmNpZV0Ly2va2GzrwERNc7PTOjIzWXUsAEPQs7gj4jlJz3W2X7J9StKkJIq7Iksr65pfXFX7zJYkaX2zrfnFVUmivIHXgb7ucduekjQj6XgZYVDMwvLaq6W9o31mSwvLaxUlAjBMhYvb9qWSvijpzoh4cZfjR203bTdbrdYgM6LLxma7r/0ALi6Fitv2uLZL+96IWNxtTEQci4hGRDTq9fogM6LLgYlaX/sBXFyKrCqxpLslnYqIT5QfCb3MzU6rNj52zr7a+JjmZqcrSgRgmIpccR+W9EFJ19s+2Xm8u+RcOI8jM5P6+PvepsmJmixpcqKmj7/vbbwxCbxOFFlV8g1JHkIW9OHIzCRFDbxO8clJAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZPb1GmD705JulnQ6Iq4sPxIweEsr61pYXtPGZlsHJmqam53WkZnJqmMBr0mRK+5/kXRjyTmA0iytrGt+cVXrm22FpPXNtuYXV7W0sl51NOA16VncEfF1Sf87hCxAKRaW19Q+s3XOvvaZLS0sr1WUCLgwA7vHbfuo7abtZqvVGtTTAhdsY7Pd135g1A2suCPiWEQ0IqJRr9cH9bTABTswUetrPzDqWFWCi97c7LRq42Pn7KuNj2ludrqiRMCF6bmqBMhuZ/UIq0pwsSiyHPCzkq6TtN/2s5L+OiLuLjsYMEhHZiYpalw0ehZ3RLx/GEEAAMVwjxsAkqG4ASAZihsAkqG4ASAZihsAknFEDP5J7Zak77/Gv75f0vMDjDMo5OoPufpDrv6MYq4LzfSrEVHoY+elFPeFsN2MiEbVObqRqz/k6g+5+jOKuYaZiVslAJAMxQ0AyYxicR+rOsAeyNUfcvWHXP0ZxVxDyzRy97gBAOc3ilfcAIDzqKy4bX/a9mnbT+xx3Lb/zvbTtr9j+9AIZLrO9gu2T3Yef1V2ps55r7D9VdunbD9p+45dxlQxX0VyDX3ObL/R9rdsf7uT62O7jPlF25/vzNdx21Mjkut2262z5uv3y87VOe+Y7RXbD+1ybOhzVTBXVXP1jO3Vzjmbuxwv/7UYEZU8JF0r6ZCkJ/Y4/m5JD0uypGskHR+BTNdJeqiCubpc0qHO9mWS/lPSr4/AfBXJNfQ568zBpZ3tcUnHJV3TNeaPJX2qs32bpM+PSK7bJf1DBf/G/kzSfbv9t6pirgrmqmqunpG0/zzHS38tVnbFHb1/CfF7Jf1rbPumpAnbl1ecqRIR8VxEPN7ZfknSKUnd/3PpKuarSK6h68zBy50vxzuP7jdz3ivpns72/ZJusO0RyDV0tg9Keo+ku/YYMvS5KphrVJX+Whzle9yTkn5w1tfPagRKQdI7Oj/qPmz7N4Z98s6PqTPavlo7W6XzdZ5cUgVz1vkR+6Sk05IejYg95ysiXpH0gqS3jEAuSfrdzo/Y99u+ouxMkj4p6aOSfrrH8UrmqkAuafhzJW1/s/2K7RO2j+5yvPTX4igX927f0au+Onlc2x9Lfbukv5e0NMyT275U0hcl3RkRL3Yf3uWvDGW+euSqZM4iYisirpJ0UNLVtq/sGlLJfBXI9SVJUxHxm5L+Qz+70i2F7ZslnY6IE+cbtsu+UueqYK6hztVZDkfEIUk3SfqI7Wu7jpc+X6Nc3M9KOvs76EFJGxVlkSRFxIs7P+pGxJcljdveP4xz2x7XdjneGxGLuwypZL565apyzjrn3JT0NUk3dh16db5s75P0SxribbK9ckXEDyPix50v/0nSb5Uc5bCkW2w/I+lzkq63/ZmuMVXMVc9cFczVznk3On+elvSApKu7hpT+Whzl4n5Q0oc679BeI+mFiHiuykC2f3nn3p7tq7U9fz8cwnkt6W5JpyLiE3sMG/p8FclVxZzZrtue6GzXJL1T0lNdwx6U9Hud7VslPRadd5aqzNV1L/QWbb9vUJqImI+IgxExpe03Hh+LiA90DRv6XBXJNey56pzzEtuX7WxLepek7lVopb8WK/st797llxBr+80aRcSnJH1Z2+/OPi3pR5I+PAKZbpX0R7ZfkdSWdFvZ/4A7Dkv6oKTVzv1RSfpLSb9yVrahz1fBXFXM2eWS7rE9pu1vFF+IiIds/42kZkQ8qO1vOP9m+2ltXz3eVnKmorn+1PYtkl7p5Lp9CLl+zgjMVZFcVczVWyU90LkW2Sfpvoh4xPYfSsN7LfLJSQBIZpRvlQAAdkFxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0Ay/w9oplBLS/6cOgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0xb5b5400>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "x = np.array([1, 2, 3, 4, 5])\n",
    "y = np.array([4, 2, 1, 3, 7])\n",
    "plt.scatter(x, y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但是我们仍然用 LinearRegression 拟合出一条直线，并获得直线的最优解："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFVxJREFUeJzt3W1wXOV5xvHrRhYgbBxhW1qQsBAGI1Y4DaICQogdYqSIBIa4aT6QhrRkQtw0aRKaVkzdSZtpZzqdjmcySV+mGTdJmzQvTUqMJ2USlEhACJmGRMY0JpbFW82LRCTZIBuDsCX57ofdNbLQas/aOrvnWf1/Mxqvdo+1Nw/eS0dnz6Vj7i4AQDhOK/cAAIDiENwAEBiCGwACQ3ADQGAIbgAIDMENAIEhuAEgMAQ3AASG4AaAwCyJ44uuWrXKm5ub4/jSAFCRdu7cud/d66JsG0twNzc3q7+/P44vDQAVycyeiboth0oAIDAENwAEhuAGgMAQ3AAQGIIbAAJTMLjNrMXMHp3xccjM7ijFcACANyp4OqC7D0q6XJLMrErSkKS7Y54LAIKwY9eQtvYManh8Qg21NeruatGmtsZYn7PY87ivl/SUu0c+3xAAKtWOXUPasn23JianJUlD4xPasn23JMUa3sUe475F0rfjGAQAQrO1Z/B4aOdMTE5ra89grM8bObjN7HRJN0v6rzyPbzazfjPrHxsbW6j5ACCxhscnirp/oRSzx/1uSY+4+8hcD7r7Nndvd/f2urpIdXsACFpDbU1R9y+UYoL7A+IwCQAc193VoprqqhPuq6muUndXS6zPG+nNSTM7S1KnpD+MdRoACEjuDchEnlXi7q9KWhnrJAAQoE1tjbEH9Ww0JwEgMAQ3AASG4AaAwBDcABAYghsAAkNwA0BgCG4ACAzBDQCBIbgBIDAENwAEhuAGgMAQ3AAQGIIbAAJDcANAYAhuAAgMwQ0AgSG4ASAwBDcABIbgBoDAENwAEBiCGwACQ3ADQGAIbgAIDMENAIGJFNxmVmtmd5nZXjMbMLNr4h4MADC3JRG3+6Kke939/WZ2uqSzYpwJADCPgsFtZsslbZB0myS5+1FJR+MdCwCQT5RDJWskjUn6NzPbZWZfNrOlszcys81m1m9m/WNjYws+KAAgI0pwL5F0haR/cfc2Sa9I+vPZG7n7Nndvd/f2urq6BR4TAJATJbifl/S8uz+c/fwuZYIcAFAGBYPb3X8j6Tkza8nedb2kPbFOBQDIK+pZJZ+U9M3sGSVPS/pwfCMBAOYTKbjd/VFJ7THPAgCIgOYkAASG4AaAwBDcABAYghsAAkNwA0BgCG4ACAzBDQCBIbgBIDAENwAEhuAGgMAQ3AAQGIIbAAJDcANAYAhuAAgMwQ0AgSG4ASAwBDcABIbgBoDAENwAEBiCGwACQ3ADQGAIbgAIDMENAIEhuAEgMEuibGRm+yS9LGla0pS7t8c5FAAgv0jBnfVOd98f2yQAgEg4VAIAgYka3C7pR2a208w2z7WBmW02s34z6x8bG1u4CQEAJ4ga3Ne6+xWS3i3pE2a2YfYG7r7N3dvdvb2urm5BhwQAvC5ScLv7cPbPUUl3S7oqzqEAAPkVDG4zW2pmZ+duS3qXpMfiHgwAMLcoZ5WkJN1tZrntv+Xu98Y6FQAgr4LB7e5PS3pLCWYBgGBNTR9T1Wmm7E5urIo5jxsAMMPBiUn95PEx9Q2M6P69o9r+8Wt1cf2y2J+X4AaAIjx74FX1Doyod2BEv/i/FzV1zLVy6el612XnqgQ725IIbgCY1/Qx16PPjasvG9aPjxyWJK2tX6aPblijjnS9Ll99jqpOK1Fqi+AGgDd49eiUfvrEfvXuGdH9g6Paf/ioqk4zXdW8Qn95U5M60vW6YOXSss1HcAOApN8cfE19e0fUu2dEP3vqgI5OHdPZZy7RO1vqdX26XtddUq83nVVd7jElEdwAFil316+HD6lvYFS9AyPaPXRQktS04izdevUF6mit15XNK1Rdlbxf6URwA1g0jkxN63+eOqDegRH1DYzqhYOvyUy6oukc3XlDizrTKV1cv6wkp/SdCoIbQEU7cPiI7ts7qr6BUT34xJhePTqtmuoqbbhklf6k8xJtvLReq5adUe4xi0JwA6go7q6nxg7rx3tG1Tcwop3PviR36dzlZ+p32hrV0ZrSNWtW6szqqnKPetIIbgDBm5w+pl/ue/H48epnDrwqSVrXuFyf2rhWna0pXdawPPGHQKIiuAEEaXZr8dBrUzq96jS97eKVun39Gl1/ab0aamvKPWYsCG4AwZivtdiRTmn92lVaekblx1rl/xcCCNZ8rcXb169RZ2vpW4tJQHADSJSktxaTgOAGUHb5WovXtdSrI2GtxSQguAGUXK61mCvC5FqLq1fUZFqL6XpdeWEyW4tJQHADKIl8rcW21bW684YWdaRTWhtAazEJCG4AscnXWly/NtzWYhIQ3AAWTL7WYmr5GZnWYjqlay4Ku7WYBAQ3gFOSr7V4WUOmtdiRTmldY+W0FpOA4AZQtFxrsXfPiB4YXFytxSQguAFEMldrccXx1mK91q+tWxStxSRglQHMKV9r8eJF3lpMgsjBbWZVkvolDbn7TfGNBKBc5mstfvbG1epIp9S8anG3FpOgmD3uT0sakLQ8jkF27BrS1p5BDY9PqKG2Rt1dLdrU1hjHUwGYIV9r8R2X1KmzNUVrMYEiBbeZnS/pRkl/K+kzCz3Ejl1D2rJ9tyYmpyVJQ+MT2rJ9tyQR3sACm6+1+MGrm9SZTtFaTLioe9xfkHSnpLPjGGJrz+Dx0M6ZmJzW1p5BghtYAPO1Fru7WtTZSmsxJAWD28xukjTq7jvN7Lp5ttssabMkNTU1FTXE8PhEUfcDKIzWYuWKssd9raSbzew9ks6UtNzMvuHut87cyN23SdomSe3t7V7MEA21NRqaI6Q5DxSIbr7W4qa2RnXSWqwYBYPb3bdI2iJJ2T3uP5sd2qequ6vlhGPcklRTXaXurpaFfBqg4kxOH1P/vpeyh0BGtI/W4qKQiPO4c8exOasEKGy+ay1+hNbiomDuRR3ViKS9vd37+/sX/OsCi1W+1uLGS+tpLVYIM9vp7u1RtuX/NJBAx465Hn1+XL17uNYi3ojgBhIi11rsGxjRfXu51iLyI7iBMuJaizgZBDdQQrnWYu53V9NaxMkguIGY5VqLfQOZ86uHudYiThHBDcTgwOEjun8wc6GBnz4xpldmtBbvoLWIU0RwAwugUGuRay1iIRHcwEmamj6mX87RWlzXmGktdramdFkDrUUsPIIbKMKh1yb1k8Ex9Q6M6IHBMR2cmKS1iJIjuIECcq3Fvr0jevjpTGtx5dLT1dmaUkc6pfVrV9FaREnxrw2YZWZrsW9gVIMjL0vKtBY/umGNOtK0FlFeBDegTGvxoSf2q3fO1mIrrUUkCsGNRSvXWuwbGNVDT+6ntYhgENxYNPK1FptWnKVbr75AHel6WosIAsGNilaotdiZTuliWosIDMGNipNrLfYNjOjBx19vLW64hNYiKgPBjeDlWou9A6Pq3TOiR559ScdcOnf5mZnWYmtK16yhtYjKQXAjSLnWYl/2qjAzW4ufpLWICkdwIxi0FoEMghuJ9tyLr19rkdYikMG/eCTKfK1FrrUIZBDcKLsTW4tj2n/4CK1FYB4EN8pi5NBr2V+HOqqfPblfR2gtApER3CgJd9eeFw6pd8+o+vaO6FfPv95a/CCtRaAoBYPbzM6U9KCkM7Lb3+Xun4t7MITvyNS0fv70i9nj1bQWgYUSZY/7iKSN7n7YzKolPWRmP3T3n8c8GwL04itHdd/e0Te0FrnWYlh27BrS1p5BDY9PqKG2Rt1dLdrU1ljusZBVMLjd3SUdzn5anf3wOIdCODKtxVcyp+zNaC1yrcVw7dg1pC3bd2ticlqSNDQ+oS3bd0sS4Z0QkY5xm1mVpJ2SLpb0z+7+cKxTIdGmpo+p/5mX1LvnxNbiZQ20FivB1p7B46GdMzE5ra09gwR3QkQKbneflnS5mdVKutvM1rn7YzO3MbPNkjZLUlNT04IPivLKtRb7BkZ0P63FijY8PlHU/Si9os4qcfdxM3tA0g2SHpv12DZJ2ySpvb2dQykVYK7W4orjrcV6rV9bR2uxAjXU1mhojpDmG3NyRDmrpE7SZDa0ayR1SPr72CdDyeVai30DI+rdQ2txseruajnhGLck1VRXqburpYxTYaYou0vnSfpa9jj3aZK+6+73xDsWSmW+1uJnb0yrI51S8ypai4tJ7jg2Z5UkV5SzSn4lqa0Es6BERg69dvzyXbQWMZdNbY0EdYJxgHIRyNdaXL2iRr93dZM60ylai0BACO4KVai12JFOaS2tRSBIBHcFefGVo7p/b+YQCK1FoHIR3AGb2VrsGxjRzmdeby2+t61RnbQWgYpEcAemUGuxI53SukZai0AlI7gDkK+1eM1FK/WRt1+o69MpyhHAIkJwJ1Sh1uLb19ZpGa1FYFHilZ8Q+VqLF9NaBDALwV1GudZi38Co+vaO0loEEAnBXWK0FgGcKoI7ZrQWASw0gjsG87UWu7ta1NlKaxHAySO4F8i811rsuETvvLRedWfTWgRw6gjuk0RrEUC5ENxFmK+1+Mcb16qT1iKAEiC4C5i3tci1FgGUAcE9h3ytxY50Sp2ttBYBlBfpo8KtxY50vdqaaC0CSIZFG9z5rrV4ZfM5tBYBJNqiCu6RQ69lzwIZfb21eMYSvaOlTp2tKVqLAIJQ0cFdqLXYkU7pyuYVOn0JrUUA4ai44M7XWrw821rsSKd0SYrWIoBwVURw01oEsJgEGdwzW4u9e0b0yLMnthY70vV620WraC0CqEgFg9vMVkv6uqRzJR2TtM3dvxj3YLNNTR/TL/e9lDllb0ZrsfW8TGuxI12vdQ1v0mmcsgegwkXZ456S9Kfu/oiZnS1pp5n92N33xDzb8dZi78CIHpjjWosb0yk1LtLW4o5dQ9raM6jh8Qk11Naou6tFm9oayz0WgBIoGNzu/oKkF7K3XzazAUmNkmIJ7rlai+ecVU1rcYYdu4a0ZftuTUxOS5KGxie0ZftuSSK8gUWgqAQ0s2ZJbZIejmOYex/7jT72jZ2SpIvqluoj6y9UZzpFa3GWrT2Dx0M7Z2JyWlt7BgluYBGIHNxmtkzS9yTd4e6H5nh8s6TNktTU1HRSw1x9YeZai9enU7qQ1mJew+MTRd0PoLJEap6YWbUyof1Nd98+1zbuvs3d2929va6u7qSGOWfp6bp9/RpCu4B8v42Q31IILA4Fg9syTZWvSBpw98/HPxIK6e5qUc2sUx1rqqvU3dVSpokAlFKUPe5rJX1I0kYzezT78Z6Y58I8NrU16u/e92Y11tbIJDXW1ujv3vdmjm8Di0SUs0oeksQ7gwmzqa2RoAYWKX67EgAEhuAGgMAQ3AAQGIIbAAJDcANAYAhuAAgMwQ0AgSG4ASAwBDcABIbgBoDAENwAEBiCGwACQ3ADQGAIbgAIDMENAIEhuAEgMAQ3AASG4AaAwBDcABAYghsAAkNwA0BgCG4ACAzBDQCBIbgBIDBLCm1gZl+VdJOkUXdfF/9IwMLbsWtIW3sGNTw+oYbaGnV3tWhTW2O5xwJOSpQ97n+XdEPMcwCx2bFrSFu279bQ+IRc0tD4hLZs360du4bKPRpwUgoGt7s/KOnFEswCxGJrz6AmJqdPuG9iclpbewbLNBFwahbsGLeZbTazfjPrHxsbW6gvC5yy4fGJou4Hkm7Bgtvdt7l7u7u319XVLdSXBU5ZQ21NUfcDScdZJah43V0tqqmuOuG+muoqdXe1lGki4NQUPKsECF3u7BHOKkGliHI64LclXSdplZk9L+lz7v6VuAcDFtKmtkaCGhWjYHC7+wdKMQgAIBqOcQNAYAhuAAgMwQ0AgSG4ASAwBDcABMbcfeG/qNmYpGdO8q+vkrR/AcdZKMxVHOYqDnMVJ4lznepMF7h7pNp5LMF9Ksys393byz3HbMxVHOYqDnMVJ4lzlXImDpUAQGAIbgAITBKDe1u5B8iDuYrDXMVhruIkca6SzZS4Y9wAgPklcY8bADCPsgW3mX3VzEbN7LE8j5uZ/YOZPWlmvzKzKxIw03VmdtDMHs1+/FXcM2Wfd7WZ3W9mA2b2azP79BzblGO9osxV8jUzszPN7Bdm9r/Zuf56jm3OMLPvZNfrYTNrTshct5nZ2Iz1uj3uubLPW2Vmu8zsnjkeK/laRZyrXGu1z8x2Z5+zf47H438tuntZPiRtkHSFpMfyPP4eST+UZJLeKunhBMx0naR7yrBW50m6Inv7bEmPS2pNwHpFmavka5Zdg2XZ29WSHpb01lnbfFzSl7K3b5H0nYTMdZukfyrDv7HPSPrWXP+vyrFWEecq11rtk7Rqnsdjfy2WbY/bC1+E+L2Svu4ZP5dUa2bnlXmmsnD3F9z9keztlyUNSJr9y6XLsV5R5iq57Boczn5anf2Y/WbOeyV9LXv7LknXm5klYK6SM7PzJd0o6ct5Nin5WkWcK6lify0m+Rh3o6TnZnz+vBIQCpKuyf6o+0Mzu6zUT579MbVNmb21mcq6XvPMJZVhzbI/Yj8qaVTSj90973q5+5Skg5JWJmAuSfrd7I/Yd5nZ6rhnkvQFSXdKOpbn8bKsVYS5pNKvlZT5ZvsjM9tpZpvneDz212KSg3uu7+jl3jt5RJla6lsk/aOkHaV8cjNbJul7ku5w90OzH57jr5RkvQrMVZY1c/dpd79c0vmSrjKzdbM2Kct6RZjrvyU1u/tvSerV63u6sTCzmySNuvvO+Tab475Y1yriXCVdqxmudfcrJL1b0ifMbMOsx2NfryQH9/OSZn4HPV/ScJlmkSS5+6Hcj7ru/gNJ1Wa2qhTPbWbVyoTjN919+xyblGW9Cs1VzjXLPue4pAck3TDroePrZWZLJL1JJTxMlm8udz/g7keyn/6rpN+OeZRrJd1sZvsk/aekjWb2jVnblGOtCs5VhrXKPe9w9s9RSXdLumrWJrG/FpMc3N+X9PvZd2jfKumgu79QzoHM7NzcsT0zu0qZ9TtQguc1SV+RNODun8+zWcnXK8pc5VgzM6szs9rs7RpJHZL2ztrs+5L+IHv7/ZLu8+w7S+Wca9ax0JuVed8gNu6+xd3Pd/dmZd54vM/db521WcnXKspcpV6r7HMuNbOzc7clvUvS7LPQYn8tlu0q7zbHRYiVebNG7v4lST9Q5t3ZJyW9KunDCZjp/ZL+yMymJE1IuiXuf8BZ10r6kKTd2eOjkvQXkppmzFby9Yo4VznW7DxJXzOzKmW+UXzX3e8xs7+R1O/u31fmG85/mNmTyuw93hLzTFHn+pSZ3SxpKjvXbSWY6w0SsFZR5irHWqUk3Z3dF1ki6Vvufq+ZfUwq3WuR5iQABCbJh0oAAHMguAEgMAQ3AASG4AaAwBDcABAYghsAAkNwA0BgCG4ACMz/A13F6BU5ePHBAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0xb8bf438>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "X = x[:, np.newaxis]\n",
    "model = LinearRegression().fit(X, y)\n",
    "yfit = model.predict(X)\n",
    "plt.scatter(x, y)\n",
    "plt.plot(x, yfit);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "很显然，我们需要用一个更复杂的模型来描述 x 与 y 的关系。可以对数据进行变换，并增加额外的特征来提升模型的复杂度。例如，可以在数据中增加多项式特征："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  1.   1.   1.]\n",
      " [  2.   4.   8.]\n",
      " [  3.   9.  27.]\n",
      " [  4.  16.  64.]\n",
      " [  5.  25. 125.]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "poly = PolynomialFeatures(degree=3, include_bias=False)\n",
    "X2 = poly.fit_transform(X)\n",
    "print(X2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在衍生特征矩阵中，第 1 列表示 x，第 2 列表示 x 2 ，第 3 列表示 x 3 。通过对这个扩展的输入矩阵计算线性回归，就可以获得更接近原始数据的结果了:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0xb94e0f0>]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xd8VFXCxvHfSYMEEkJIKCGEUEOvodnFAoKr2NG1F9DdtawuKG59d99XVFZX3V1XELuyKojs6oKAXZRiqKEk9JZQQiAJIX1y3j8SWMRAZiAzd2byfD+ffBhmLjMPB/LMzZl77zHWWkREJHCEOB1AREQ8o+IWEQkwKm4RkQCj4hYRCTAqbhGRAKPiFhEJMCpuEZEAo+IWEQkwKm4RkQAT5o0njY+PtykpKd54ahGRoLR8+fID1toEd7b1SnGnpKSQnp7ujacWEQlKxpgd7m6rqRIRkQCj4hYRCTAqbhGRAKPiFhEJMCpuEZEAo+IWEQkwdRa3MSbVGLPquK9CY8xDvggnIiI/Vudx3NbaLKAfgDEmFMgGPvRyLhGRgDBnZTZPfZLJnoJSWsc05rHLujGmf1uvvqanJ+BcBGyx1rp9oLiISLCaszKbSbMzKKlwAbC3sJRJszMAvFrens5xjwX+6Y0gIiKBZsr8rGOlfVRJhYsp87O8+rpuF7cxJgK4Aph5ksfHGWPSjTHpubm59ZVPRMRv5eSXeHR/ffFkj/syYIW1dl9tD1prp1lr06y1aQkJbl0nRUQkoCVEN6r1/sTYSK++rifFfSOaJhEROaZ5VMSP7osMD2XCiFSvvq5bxW2MiQIuAWZ7NY2ISID4amMuWfsOM6ZfIm1jIzFA29hIJl/d2z+OKrHWFgMtvJpERCRAuKosk+duIDkuiqev7UtEmG/PZdSZkyIiHvpgxW4y9x5m4shUn5c2qLhFRDxSUu7imQVZ9GsXy+jebRzJoOIWEfHAK4u2sq+wjF+P7o4xxpEMKm4RETcdKCrjpa+2cmmPVgxKiXMsh4pbRMRNz3+6iZIKF49e1s3RHCpuERE3bMktYsayndw0OJlOCU0dzaLiFhFxw1PzMokMD+XBi7s4HUXFLSJSl++3H2TB+n3ce35H4pvWfpq7L6m4RUROwVrLE3M30DqmMXed09HpOICKW0TklOZm7GXlznwevrQrkRGhTscBVNwiIidVXlnF0/Mz6dY6mmsGJDkd5xgVt4jISby9ZAc78oqZNKo7oSHOnGxTGxW3iEgtCkoqeOHzTZzbJZ7zu/rXGgMqbhGRWrz45WYKSip4zOGTbWqj4hYROcHuQ8W89u12rurflp6JzZyO8yMqbhGREzyzYCMG+NWl3l3J5nSpuEVEjrM2u4APV2Zz5zkdvL525OlScYuI1Dh6sk1ckwjuu6CT03FOSsUtIlLjy6xcvtuSxwPDOxPTONzpOCel4hYRASpdVUyet4GUFlHcNKS903FOScUtIgLMWr6bjfuKeHRkN0fWkfSEf6cTEfGB4vJKnl24kYHtmzOyV2un49RJxS0iDd7LX29j/+EyHh/VzbF1JD2h4haRBm3/4VKmfr2Fy3q1ZmB759aR9IRbxW2MiTXGzDLGZBpjNhhjhnk7mIiILzz36SbKK6uYONL/Tm0/mTA3t3se+MRae60xJgKI8mImERGf2Lz/MO99v4tbhranQ3wTp+O4rc7iNsbEAOcBtwNYa8uBcu/GEhHxvifnZRIVHsoDFzm/jqQn3Jkq6QjkAq8ZY1YaY6YbY3701mSMGWeMSTfGpOfm5tZ7UBGR+rRkax6fbtjPfRd2Iq5JhNNxPOJOcYcBA4B/WGv7A0eAx07cyFo7zVqbZq1NS0jwr2vXiogcr6qq+tT2xGaNufPsDk7H8Zg7xb0b2G2tXVrz+1lUF7mISED6aE0Oa3YX8MilqTQO9491JD1RZ3Fba/cCu4wxR69veBGw3qupRES8pKzSxZT5WfRoE8NV/ds6Hee0uHtUyf3AOzVHlGwF7vBeJBER73nzux3sPlTC23f1IcSP1pH0hFvFba1dBaR5OYuIiFflF5fz1883cX7XBM7pEu90nNOmMydFpMH42+ebKSqrZNKowDnZpjYqbhFpEHYdLObNxTu4dmAS3VrHOB3njKi4RaRBeHp+FiEh8PAl/rmOpCdU3CIS9Fbvyuej1Tncc25HWjdr7HScM6biFpGgZq3l/+ZuIL5pBOPP9991JD2h4haRoPbphv0s23aQBy/uStNG7h4B7d9U3CIStCpdVTw5bwMdE5owdlA7p+PUGxW3iAStd7/fxZbcIzw2shvhocFTd8HzNxEROU5RWSXPfbqRwSlxXNKjldNx6lVwTPiIiJxg2ldbOFBUzvTbugfEOpKe0B63iASdfYWlvPzNNi7v04Z+7WKdjlPvVNwiEnSeXbCRyqoqJo4I7FPbT0bFLSJBJWvvYWYu38Wtw1JIbhGcy+OquEUkqEyet4GmjcK4f3hnp6N4jYpbRILGt5sP8GVWLr8Y3pnYqMBaR9ITKm4RCQpH15FsGxvJrcNSnI7jVSpuEQkKc1Zlsy6nkIkjA3MdSU+ouEUk4JVWuPjz/Cx6t23GT/okOh3H61TcIhLwXvt2OzkFpTw+qnvAriPpCRW3iAS0g0fKefGLzVzUrSXDOrVwOo5PqLhFJKC98NkmjpRX8thlwXmyTW1U3CISsLYfOMLbS3Zww6BkurSKdjqOz6i4RSRgPT0/k4iwEH55SReno/iUW1cHNMZsBw4DLqDSWpvmzVAiInVZvuMQczP28tDFXWgZHfjrSHrCk8u6XmitPeC1JCIibrK2+mSbhOhG3HNuR6fj+JymSkQk4Mxft5flOw7x8CVdaRIk60h6wt3itsACY8xyY8w4bwYSETmVClcVT32SRZeWTbluYJLTcRzh7lvV2dbaHGNMS2ChMSbTWvv18RvUFPo4gOTk5HqOKSJSbcbSnWw7cIRXb08jLIjWkfSEW39ra21Oza/7gQ+BwbVsM81am2atTUtISKjflCIiQGFpBc9/tolhHVtwYWpLp+M4ps7iNsY0McZEH70NXAqs9XYwEZETvfTlFg4eKefxUcG3jqQn3JkqaQV8WDNIYcAMa+0nXk0lInKCnPwSXlm0jTH9Eumd1MzpOI6qs7ittVuBvj7IIiJyUs8s2IgFfjUi1ekojmuYM/siElDW5xQye+Vu7jgrhaTmwbmOpCdU3CLi9ybP20CzyHB+dmHwriPpCRW3iPi1rzbm8s2mA9w/vAvNIsOdjuMXVNwi4rdcVZbJczeQHBfFLUPbOx3Hb6i4RcRvfbBiN5l7DzNxZCoRYaqrozQSIuKXSspdPLMgi37tYhndu43TcfyKiltE/NIri7ayr7CMX49u2Cfb1EbFLSJ+50BRGS99tZVLe7RiUEqc03H8jopbRPzO859uoqTCxaMNaB1JT6i4RcSvbMktYsayndw0OJlOCU2djuOXVNwi4leempdJZHgoD17csNaR9ISKW0T8xrJtB1mwfh/3nt+R+KaNnI7jt1TcIuIXjq4j2TqmMXed0/DWkfSEiltE/MJ/Mvawalc+D1/alciIUKfj+DUVt4g4rqzSxdOfZNGtdTTXDGiY60h6QsUtIo57e8lOdh4sZtKo7oSG6GSbuqi4RcRRBSUV/PXzTZzbJZ7zu2q9WneouEXEUS9+sZmCkgomXdbd6SgBQ8UtIo7ZfaiY177bztX9k+iRGON0nICh4hYRx/x5fhYG+NWIrk5HCSgqbhFxRMbuAuasyuGuczrQplmk03ECiopbRHzu6Mk2cU0iuPeCTk7HCTgqbhHxuS+y9rN4ax4PXtSFmMZaR9JTKm4R8alKVxWT52bSIb4JNw1JdjpOQHK7uI0xocaYlcaYj70ZSESC28zlu9m0v4hHR6YSHqp9x9MR5sG2DwIbAK8cszNnZTZPzstkb2EprWMa89hl3RjTv603XkpEHHKkrJJnF24krX1zRvRs7XScgOXW250xJgkYDUz3Rog5K7OZNDuDvYWlAOwtLOWxD9YwZ2W2N15ORBzy8jdbyT1cxqRRWkfyTLj7c8pzwESgyhshpszPoqTC9YP7SiurmDxvgzdeTkQcsP9wKdO+3sqo3q0Z2L6503ECWp3FbYy5HNhvrV1ex3bjjDHpxpj03Nxcj0Lk5JfUev++wjK2Hzji0XOJiH/6y8JNVLiqmDhC60ieKXf2uM8GrjDGbAfeBYYbY94+cSNr7TRrbZq1Ni0hwbMLxSTG1n7wfYiB66YuZuO+wx49n4j4l037DvPe9zv56ZD2pMQ3cTpOwKuzuK21k6y1SdbaFGAs8Lm19ub6DDFhRCqR4T+8cHpkeCgTR3TDADdMXcza7IL6fEkR8aEn52XSJCKMBy7SOpL1wS+OxRnTvy2Tr+5N29hIDNA2NpLJV/fm3gs68f74YURFhHHjy0tYvuOg01FFxEOLt+TxWeZ+fnZhZ+KaRDgdJygYa229P2laWppNT0+vt+fLzi/h5ulL2VdYyvRb0zirc3y9PbeIeE9VleXKv39LXlEZn//qAhqHa0mykzHGLLfWprmzrV/scdelbWwk740fSrvmUdz++vd8nrnP6Ugi4oaP1uSQkV3Ar0akqrTrUUAUN0DL6Ma8O24oqa2iGf/WcuZm7HE6koicQmlF9TqSPRNjGNNPJ9PVp4ApboDmTSJ4554h9E2K5RczVvDB8t1ORxKRk3hz8Xay80t4fFR3QrSOZL0KqOIGiGkczpt3DeasTvE8MnM1by3Z4XQkETlBfnE5f/t8MxekJnC2PpOqdwFX3ABREWFMvy2Ni7q15Ldz1jLt6y1ORxKR4/z1880UlVVqHUkvCcjiBmgcHspLtwxkdJ82PDE3k78s3Ig3jpAREc/szCvmzcXbuW5gO1JbRzsdJyh5cnVAvxMeGsILY/sTGR7K859tori8ksd18RoRRz09P5OwkBAevlTrSHpLQBc3QGiI4elr+tAkIpSXv9lGcbmLP13ZSx+GiDhg1a58Pl6zhweGd6ZVTGOn4wStgC9ugJAQwx+u6ElkRBgvfbWFknIXT1/bhzBdpF3EZ6y1PPGfDcQ3jWDc+VpH0puCorgBjDE8OjKVJhGhPLNwIyUVLp4f25+IMJW3iC8sXL+PZdsP8r9jetG0UdBUi18KqlYzxnD/RV34zejuzFu7l/FvpVN6wnW+RaT+VbiqePKTTDolNGHsoHZOxwl6QVXcR919bkeeuKo3X27M5Y7XvudIWaXTkUSC2rvf72Jr7hEeu6y7pih9IGhH+KYhyTx7fV+WbT/ILa8spaCkwulIIkGpqKyS5z/dyOAOcVzcvaXTcRqEoC1ugKv6J/H3mwaQkV3AjdOWkFdU5nQkkaAz9astHCgq59c6FNdngrq4AUb2as3Lt6axJbeIsdOWsK9mQWIROXN7C0p5+Zut/KRvIn3bxTodp8EI+uIGuCC1JW/cOZic/BKun7qY3YeKnY4kEhSeXZhFVRVMHJHqdJQGpUEUN8DQji14++4hHDpSzvUvLWZrbpHTkUQCWubeQmYu382tw9rTLi7K6TgNSoMpboD+yc15d9wwyiqruH7qErL2ahFikdM1eW4m0Y3C+MXwzk5HaXAaVHED9EiM4b3xQwkNgRumLWbN7nynI4kEnEWbDvDVxlzuH96F2CitI+lrDa64ATq3jGbm+LNo2iiMm15eyvfbtQixiLuqqixPzN1AUvNIbj2rvdNxGqQGWdwAyS2imHnvMFpGN+LWV5axaNMBpyOJBIQPV2azfk8hE0ak0ihM60g6ocEWN0CbZpG8N34Y7VtEcefr3/Ppei1CLHIqpRUunlmQRZ+kZvykT6LTcRqsBl3cAAnRjXh33FC6t4nm3reX89HqHKcjifitV7/dRk5BqdaRdFiDL26A2KgI3r57CAOSm/Pguyt5P32X05FE/E5eURn/+GILF3dvydCOLZyO06DVee1FY0xj4GugUc32s6y1v/d2MF+LbhzOG3cOZtxb6UyctYaSche3nZXidCwRR8xZmc2U+Vnk5JeQGBvJhBGprNqVT3GFi8cu6+Z0vAbPnYvmlgHDrbVFxphwYJExZp61domXs/lcZEQo029L4xczVvL7f6+juNzFfRfogvDSsMxZmc2k2RmU1FwSOTu/hEc/WEOFq4qxg5Pp3FLrSDqtzqkSW+3oaYbhNV9Buypvo7BQXvzpAK7om8hTn2TyzIIsLUIsDcqU+VnHSvuossoqrIWHLu7iUCo5nlvLVBhjQoHlQGfg79bapbVsMw4YB5CcnFyfGX0uPDSEv9zQj6iIUP76+WaKy138ZrSufCYNQ05+Sa33W6BltNaR9AdufThprXVZa/sBScBgY0yvWraZZq1Ns9amJSQk1HdOnwsNMUy+ujd3nJ3CK4u28fiHGbiqtOctwS8xNrLW+9s0U2n7C4+OKrHW5gNfAiO9ksbPGGP43eU9+PmFnfjnsl088v4qKl1VTscS8aoJI1KJDP/hiTXhoYZHR+pDSX/hzlElCUCFtTbfGBMJXAw85fVkfsIYw4QR3YiKCDs29/fCjf11xpgErTH92wLw9CeZ5BSUEhZieOrqPsfuF+e5s8fdBvjCGLMG+B5YaK392Lux/M/PL+zM73/Sg/nr9jHuzeWUlGsRYgleY/q3Zdx5HQF4+dY0rh6Y5HAiOV6de9zW2jVAfx9k8Xt3nN2BqIhQHpudwe2vLeOV2wfRtJFbn++KBJTPM/fx7MKNnNWpBRekBv5nVsFGZ0566IZByTx3Qz/Sdxzip9OXUlCsRYgleBSWVjBh5mrufD2dxNhInriqt46m8kPaXTwNV/ZrS2R4KL+YsZKxLy/hrbsGE9+0kdOxRM7Iok0HmDhrNXsLS/n5hZ144KIu+izHT2mP+zRd2rM1029LY9uBIm6Yupi9BVqEWALTkbJKfjMng5tfWUrjiFA+uO8sJozoptL2YyruM3Be1wTevHMI+wrLuG7qd+w6qEWIJbAs3ZrHZc9/wztLd3L3OR2Y+8C59E9u7nQsqYOK+wwN7hDHO3cPobCkkuteWswWLUIsAaC0wsWfPl7P2JerLzn03rhh/ObyHjQO1152IFBx14O+7WJ5d9xQKququGHqYjbsKXQ6kshJrdx5iFEvfMMri7Zx85D2zHvwXAZ3iHM6lnhAxV1PureJ4b3xwwgPDWHstCWs2qVFiMW/lFW6ePqTTK75x3eUlrt4+64h/GlML5rokNaAo+KuR50SmvL++GE0iwzn5ulLWbo1z+lIIgCszS7gyr99y4tfbuHagUl88svzOKdLvNOx5DSpuOtZu7go3h8/jFYxjbjttWV8tTHX6UjSgFW4qnj+002M+fu35B0p59Xb03j62r7ENA53OpqcARW3F7Ru1pj3xw+jY3xT7nkjnfnr9jodSRqgjfsOc/WL3/GXTzcyuk8bFv7yPIZ3a+V0LKkHKm4vadG0Ef+8Zyg9EmP42Tsr+NeqbKcjSQPhqrK89NUWLn9hEdn5JfzjpwN4fmx/YqMinI4m9USfSnhRs6hw3r57CHe/8T0PvbeKknIXYwcH9iIT4t+25hbxq5mrWbEzn5E9W/O/V/XSWb1BSMXtZU0bhfH6HYO59+3lPDY7g+JyF3ee08HpWBJkqqosbyzezlOfZBIRGsJzN/Tjyn6Jus5IkFJx+0Dj8FCm3jKQB/+5ij9+vJ6SChc/v7Cz07EkSOw6WMyEWatZsvUgF6Ym8OQ1fWgVo9VqgpmK20cahYXyt5v6M2HWGqbMz+JIWSUTRqRqj0hOm7WWfy7bxf/9Zz3GGJ66pjfXp7XT/6kGQMXtQ2GhITxzXV8iI0J58cstFJe7+N3lPQgJ0TeaeGZPQQmPfpDB1xtzObtzC566pg9JzaOcjiU+ouL2sZAQw/+N6UVkeCivLNpGSbmLJ67uTajKW9xgrWX2imz+8NE6Kl2WP13Zk58Oaa83/wZGxe0AYwy/Gd2dJo3CeOGzTRRXuHj2+r6Eh+roTDm5/YdLeXz2Wj7dsI9BKc2Zcm1fUuKbOB1LHKDidogxhocv6UpURChPzsukpNzF327qr6uzSa0+XpPDb+es5Ui5i9+M7s4dZ3fQT2kNmIrbYfee34moiFB+96913PNmOlNvGUhUhP5ZpNrBI+X89l9r+c+aPfRNasYz1/elc8top2OJw9QQfuDWYSlEhofy6AdruO3VZbx6+yCidS2JBm/h+n1Mmp1BQUk5E0akMv68joRpOk1QcfuN69LaERkRykPvruKn05fy5p2DdYpyA1VQUsH/fLSO2Suy6d4mhjfvHEyPxBinY4kfUXH7kcv7JBIZHsp976xg7LQlvHXXEBKidbpyQ/LVxlwenbWG3KIy7h/emfuHdyEiTHvZ8kN1/o8wxrQzxnxhjNlgjFlnjHnQF8Eaqou6t+K12wexI6+YG6YuJie/xOlI4gNFZZVMmp3Bba8uo2njMGbfdxaPXJqq0pZaGWvtqTcwpg3Qxlq7whgTDSwHxlhr15/sz6Slpdn09PT6TdrApG8/yB2vfU9MZDgz7hlC+xY/POxrzspspszPIie/hMTYSCaMSGVM/7YOpZUzsXhLHhNmrSY7v4Rx53bkl5d01dFFDZAxZrm1Ns2dbet8O7fW7rHWrqi5fRjYAKghvCwtJY4Z9wzlSHkl109dzOb9h489NmdlNpNmZ5CdX4IFsvNLmDQ7gzkrdenYQFJS7uIP/17HjS8vISzEMHP8MCaN6q7Sljp59HOYMSYF6A8s9UYY+aHeSc14b9wwqixcP3UJ63IKAJgyP4uSCtcPti2pcDFlfpYTMeU0LN9xkFEvfMPr323n9rNSmPvguaSlaMFecY/bxW2MaQp8ADxkrf3RMubGmHHGmHRjTHpurpbrqi+praN5f/wwGoeFcOO0JazYeeik896aD/d/pRUuJs/bwHUvLaa8sooZdw/hD1f01LH74pE657gBjDHhwMfAfGvts3Vtrznu+rf7UDE3T1/K/sNlRIWHcuBI+Y+2aRsbybePDXcgnbgjY3cBD7+/ik37i7hxcDseH9Vdx+vLMfU6x22qrxH5CrDBndIW70hqXr0IcdvYSApKK4g44USMyPBQJoxIdSidnEp5ZRXPLtzImBe/pbC0gtfuGMTkq/uotOW0uTNVcjZwCzDcGLOq5muUl3NJLVrGNOa98cPo2ioal7XERUVgqN7Tnnx1bx1V4ocy9xZy1Yvf8sJnm7iybyILHjqfC1NbOh1LAlydE2vW2kWArmbjJ+KaRDDjnqHc+fr3rNqVzx/H9GJMv0TtvfmZSlcVU7/eynOfbqRZZDhTbxnIiJ6tnY4lQcKtOW5PaY7b+46UVXLPm+l8tyWPEAO92jZjSIc4hnRowaAOcTSLVJE7ZUtuEY+8v5pVu/IZ3bsNf7yyJy20YK/UwZM5bhV3AKt0VbFk60GWbstj6daDrNqVT7mrCmOge+sYhnSsLvIhHeJo3kTXPfG2qirLq99uY8r8LCIjQvnTlb34Sd9Ep2NJgFBxN1ClFS5W7sw/VuQrdh6irLIKgNRW0f8t8o5xxGsPsF7tyDvChJlrWLb9IBd3b8kTV/emZbQW7BX3qbgFgLJKF2t2F7B0ax5Ltx0kffuhYyfudEpowpCO1XvjQzu20Krgp8lay9tLdzJ57gZCjeH3V/TkmgFttWCveEzFLbWqcFWRkV3A0prplfTthygqqwQgpUUUQzu2OLZXnhgb6XBa/5edX8Kjs9awaPMBzu0Sz1PX9NG4yWlTcYtbKl1VrN9TeKzIl207SGFpdZG3i4s8Nj8+tGMLkppHai+yhrWWmct386eP1uOyll+P7s5Ng5M1PnJGVNxyWlxVlsy9PyzyQ8UVACQ2a3xsamVIxxaktIhqkEW1v7CUSbMz+CxzP4M7xPHna/uS3CLK6VgSBFTcUi+qqiyb9hcd+7Bz6bY8DhRVn2rfMrrRcXPkcXRKaBrURW6t5d+rc/jdv9ZRWuFi4shu3HFWCiFasFfqiYpbvMJay5bcIyzdlld9GOLWPPYfLgMgvmkEgzv896iVri2jg6bU8orK+O2/1jI3Yy/9k2P583V96ZTQ1OlYEmRU3OIT1lq25xUfO2pl6dY8cgpKAWgeFc6glLhje+Xd28QQGoBF/snavfz6wwwOl1by0CVdGHeuFuwV7/CkuHUtSTltxhg6xDehQ3wTxg5OxlrL7kMlLDla5NvyWLB+HwAxjcNqirx6r7xnYoxfF2BBcQV/+GgdH67MpmdiDDPu6Udq62inY4kAKm6pR8YY2sVF0S4uiuvS2gHV1wj/7xz5QT7L3A9A00ZhDGzf/FiR90lqRrifFPkXWft57IM15BWV89DFXfj5hZ39JpsIaKpEfGxfYemxaZWl2w6yeX8RUH1Z2rSU5seOWumT1IxGYb5dwutwaQX/+/EG3kvfRddWTXn2+n70atvMpxmk4dIctwSMA0VlLDuuyDP3Vq+t2SgshAHJ/90j758c69W1GL/dfICJs9awp6CE8ed34qGLu/j8jUMaNhW3BKxDR8pZtv3gscMP1+8pxFqICA2hX7vYY0U+oH1svSz3VVxeyZPzMnlz8Q46xjfhz9f3ZUBy83r4m4h4RsUtQaOgpIL07QePTa+szSnEVWUJCzH0SWp27KiVtJQ4mjbyrMjTtx/kkZmr2ZFXzJ1nd2DCiFQiI7SXLc5QcUvQKiqr/EGRr9ldQGWVJTTE0Csx5gdFfrJrkpdWuHhmQRbTF20jqXkkU67ty9COLXz8NxH5IRW3NBjF5ZWs2JFf6zXJe7SJOXZC0JAOccRGRbB6Vz6PzFzN5v1F3DQkmcdHdfd4T13EG1Tc0mCd6prkXVo2ZeuBI7SMbsRT1/ThvK4JDqcV+S+dgCMNVuPwUIZ1asGwTtVTH8dfk3zZ9kMM6RjHhBHdtLSbBDQVtwS1RmGhDEqJY1BKnNNRROqNTgcTEQkwKm4RkQCj4hYRCTB1znEbY14FLgf2W2t7eT+SSP2bszKbKfOzyMkvITE2kgkjUhnTv63TsUROizt73K8DI72cQ8Rr5qzMZtLsDLLzS7BUL/I7aXYGc1ZmOx1N5LTUWdzW2q+Bgz7IIuIVU+ZnUVLh+sF9JRUupszPciiRyJmptzluY8yI+EJ7AAAFFUlEQVQ4Y0y6MSY9Nze3vp5W5Izl5Jd4dL+Iv6u34rbWTrPWpllr0xISdEaa+I/E2EiP7hfxdzqqRILehBGpRJ5wLe/I8FAmjEh1KJHImdGZkxL0jh49oqNKJFi4czjgP4ELgHhjzG7g99baV7wdTKQ+jenfVkUtQaPO4rbW3uiLICIi4h7NcYuIBBgVt4hIgFFxi4gEGBW3iEiAUXGLiAQYr6w5aYzJBXac5h+PBw7UY5z6olyeUS7PKJdn/DHXmWZqb61167RzrxT3mTDGpLu7YKYvKZdnlMszyuUZf8zly0yaKhERCTAqbhGRAOOPxT3N6QAnoVyeUS7PKJdn/DGXzzL53Ry3iIicmj/ucYuIyCk4VtzGmFeNMfuNMWtP8rgxxrxgjNlsjFljjBngB5kuMMYUGGNW1Xz9ztuZal63nTHmC2PMBmPMOmPMg7Vs48R4uZPL52NmjGlsjFlmjFldk+t/atmmkTHmvZrxWmqMSfGTXLcbY3KPG6+7vZ2r5nVDjTErjTEf1/KYz8fKzVxOjdV2Y0xGzWum1/K4978XrbWOfAHnAQOAtSd5fBQwDzDAUGCpH2S6APjYgbFqAwyouR0NbAR6+MF4uZPL52NWMwZNa26HA0uBoSds8zPgpZrbY4H3/CTX7cDfHPg/9jAwo7Z/KyfGys1cTo3VdiD+FI97/XvRsT1uW/cixFcCb9pqS4BYY0wbhzM5wlq7x1q7oub2YWADcOLFpZ0YL3dy+VzNGBTV/Da85uvED3OuBN6ouT0LuMgYY/wgl88ZY5KA0cD0k2zi87FyM5e/8vr3oj/PcbcFdh33+934QSkAw2p+1J1njOnp6xev+TG1P9V7a8dzdLxOkQscGLOaH7FXAfuBhdbak46XtbYSKABa+EEugGtqfsSeZYxp5+1MwHPARKDqJI87MlZu5ALfjxVUv9kuMMYsN8aMq+Vxr38v+nNx1/aO7vTeyQqqT0vtC/wVmOPLFzfGNAU+AB6y1hae+HAtf8Qn41VHLkfGzFrrstb2A5KAwcaYXids4sh4uZHrIyDFWtsH+JT/7ul6hTHmcmC/tXb5qTar5T6vjpWbuXw6Vsc521o7ALgM+Lkx5rwTHvf6ePlzce8Gjn8HTQJyHMoCgLW28OiPutbauUC4MSbeF69tjAmnuhzfsdbOrmUTR8arrlxOjlnNa+YDXwIjT3jo2HgZY8KAZvhwmuxkuay1edbasprfvgwM9HKUs4ErjDHbgXeB4caYt0/YxomxqjOXA2N19HVzan7dD3wIDD5hE69/L/pzcf8buLXmE9qhQIG1do+TgYwxrY/O7RljBlM9fnk+eF0DvAJssNY+e5LNfD5e7uRyYsyMMQnGmNia25HAxUDmCZv9G7it5va1wOe25pMlJ3OdMBd6BdWfG3iNtXaStTbJWptC9QePn1trbz5hM5+PlTu5fD1WNa/ZxBgTffQ2cClw4lFoXv9edGyVd1PLIsRUf1iDtfYlYC7Vn85uBoqBO/wg07XAfcaYSqAEGOvt/8A1zgZuATJq5kcBHgeSj8vm8/FyM5cTY9YGeMMYE0r1G8X71tqPjTF/BNKttf+m+g3nLWPMZqr3Hsd6OZO7uR4wxlwBVNbkut0HuX7ED8bKnVxOjFUr4MOafZEwYIa19hNjzL3gu+9FnTkpIhJg/HmqREREaqHiFhEJMCpuEZEAo+IWEQkwKm4RkQCj4hYRCTAqbhGRAKPiFhEJMP8P96gWxowg7S8AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0xb654630>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "model = LinearRegression().fit(X2, y)\n",
    "yfit = model.predict(X2)\n",
    "plt.scatter(x, y)\n",
    "plt.plot(x, yfit)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种不通过改变模型，而是通过变换输入来改善模型效果的理念，正是许多更强大的机器学习方法的基础。5.6 节介绍基函数回归时将详细介绍这个理念，它通常被认为是强大的核方法（kernel method，5.7 节将详细介绍）技术的驱动力之一。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.4.5缺失值填充  \n",
    "特征工程中还有一种常见需求是处理缺失值。我们在 3.5 节介绍过 DataFrame 的缺失值处理方法，也看到了 NaN 通常用来表示缺失值。例如，有如下一个数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[nan  0.  3.]\n",
      " [ 3.  7.  9.]\n",
      " [ 3.  5.  2.]\n",
      " [ 4. nan  6.]\n",
      " [ 8.  8.  1.]]\n"
     ]
    }
   ],
   "source": [
    "from numpy import nan\n",
    "X = np.array([[ nan, 0, 3 ],\n",
    "[ 3, 7, 9 ],\n",
    "[ 3, 5, 2 ],\n",
    "[ 4, nan, 6 ],\n",
    "[ 8, 8, 1 ]])\n",
    "y = np.array([14, 16, -1, 8, -5])\n",
    "print(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当将一个普通的机器学习模型应用到这份数据时，首先需要用适当的值替换这些缺失数据。这个操作被称为缺失值填充，相应的策略很多，有的简单（例如用列均值替换缺失值），有的复杂（例如用矩阵填充或其他模型来处理缺失值）。  \n",
    "  \n",
    "复杂方法在不同的应用中各不相同，这里不再深入介绍。对于一般的填充方法，如均值、\n",
    "中位数、众数，Scikit-Learn 有 Imputer 类可以实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4.5, 0. , 3. ],\n",
       "       [3. , 7. , 9. ],\n",
       "       [3. , 5. , 2. ],\n",
       "       [4. , 5. , 6. ],\n",
       "       [8. , 8. , 1. ]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import Imputer\n",
    "imp = Imputer(strategy='mean')\n",
    "X2 = imp.fit_transform(X)\n",
    "X2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果矩阵中的两处缺失值都被所在列剩余数据的均值替代了。这个被填充的数据就可以直接放到评估器里训练了，例如 LinearRegression 评估器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.33782027])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LinearRegression().fit(X2, y)\n",
    "model.predict(X2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.4.6特征管道  \n",
    "如果经常需要手动应用前文介绍的任意一种方法，你很快就会感到厌倦，尤其是当你需要将多个步骤串起来使用时。例如，我们可能需要对一些数据做如下操作。  \n",
    "(1) 用中位数填充缺失值。  \n",
    "(2) 将衍生特征转换为二次方。  \n",
    "(3) 拟合线性回归模型。  \n",
    "为了实现这种管道处理过程，Scikit-Learn 提供了一个管道对象，如下所示：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "model = make_pipeline(Imputer(strategy='median'),  # (1) 用中位数填充缺失值\n",
    "                      PolynomialFeatures(degree=2),  # (2) 将衍生特征转换为二次方\n",
    "                     LinearRegression()  # (3) 拟合线性回归模型。 \n",
    "                     ) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个管道看起来就像一个标准的 Scikit-Learn 对象，可以对任何输入数据进行所有步骤的处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[14 16 -1  8 -5]\n",
      "[14. 16. -1.  8. -5.]\n"
     ]
    }
   ],
   "source": [
    "model.fit(X, y) # # 和上面一样，X带有缺失值\n",
    "print(y)\n",
    "print(model.predict(X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这样的话，所有的步骤都会自动完成。请注意，出于简化演示考虑，将模型应用到已经训练过的数据上，模型能够非常完美地预测结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
